The aim of this homework is to compare the performance of penalized regression approaches, decision trees, and tree-based ensembles. These 4 models will be trained on 4 datasets and their performance will be compared based on test errors.
In the assignment rpart, glmnet, gbm, rattle, caret , randomForest packages are used for modeling purposes.
Brief information about the chosen datasets and related links can can be found below.
airbnb_data <- fread("~/Desktop/IE582_HW4/airbnb.csv")
forest_data <- fread("~/Desktop/IE582_HW4/forest.csv")
news_data <- fread("~/Desktop/IE582_HW4/news.csv")
HR_data <- fread("~/Desktop/IE582_HW4/HR.csv")This dataset contains AirBNB listings in London, UK. There are categorical variables such as property type, room type, location. Also there are binary variable such as availability. And numerical variables such as deposit and cleaning fee.
Task: Regression Objective: Predict listing price Variables: 42
The data can be downloaded from this link
This dataset contains tree observations from four areas of the Roosevelt National Forest in Colorado. It includes information on tree type, shadow coverage, distance to nearby landmarks, soil type, and local topography and more. All observations are catogarical variables. There are 7 forest cover type from 1 to 7. Given is the attribute name, attribute type, the measurement unit and a brief description. Finding the forest type is the classification problem.
Task: Multi-class Classification Objective: Predict forest type Variables: 55
The data can be downloaded from this link
It is dataset that includes factors that lead to employee turnover.It has binary features such as gender and marital status. It has ordinal features such as education level, hob involvement and job level. It also has some numerical features such as monthly income and age.
Task: Binary classification with class imbalance of ratio 5:1 Objective: Predict employee turnover Variables: 35
This dataset summarizes a set of features about online news published by a media platform. The goal is to predict the number of shares in social networks. The dataset has some categorical and binary features such as “week day is Monday?”. It also has numerical features such as number of words.
Task: Regression Objective: Predict popularity Variables: 61 (58 predictive attributes, 2 non-predictive, 1 goal field)
The data and explanation of variable names can be found in this link
Prices are continuous. But according to the histogram it does not look like normally distributed.
I removed descriptive variables and variables that do not have variance. And I transformed numerical variables that has unique values less than 10 to caategorical.
colnames(airbnb_data) <- make.names(colnames(airbnb_data))
airbnb_data <- airbnb_data %>%
filter(price_x<500)
airbnb_data_drop <- select(airbnb_data, -c("V1","id","host_location","host_since", "street","security_deposit","cleaning_fee","city", "maximum_nights","is_business_travel_ready","requires_license","cancellation_policy","property_type","bed_type"))
airbnb_data_omit <- na.omit(airbnb_data_drop)
airbnb_data_factors <- airbnb_data_omit %>%
select_if(~n_distinct(.) <= 10)
airbnb_data_fix <- airbnb_data_omit %>%
mutate_at(c("experiences_offered" , "host_is_superhost","host_has_profile_pic", "host_identity_verified", "is_location_exact","room_type_y" , "instant_bookable", "require_guest_profile_picture", "require_guest_phone_verification" ), factor)
skim(airbnb_data_fix)| Name | airbnb_data_fix |
| Number of rows | 80234 |
| Number of columns | 28 |
| _______________________ | |
| Column type frequency: | |
| factor | 9 |
| numeric | 19 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| experiences_offered | 0 | 1 | FALSE | 5 | non: 78712, bus: 515, fam: 443, soc: 366 |
| host_is_superhost | 0 | 1 | FALSE | 2 | f: 67933, t: 12301 |
| host_has_profile_pic | 0 | 1 | FALSE | 2 | t: 79937, f: 297 |
| host_identity_verified | 0 | 1 | FALSE | 2 | f: 52044, t: 28190 |
| is_location_exact | 0 | 1 | FALSE | 2 | t: 54983, f: 25251 |
| room_type_y | 0 | 1 | FALSE | 4 | Ent: 44639, Pri: 34357, Hot: 654, Sha: 584 |
| instant_bookable | 0 | 1 | FALSE | 2 | f: 46749, t: 33485 |
| require_guest_profile_picture | 0 | 1 | FALSE | 2 | f: 79460, t: 774 |
| require_guest_phone_verification | 0 | 1 | FALSE | 2 | f: 78951, t: 1283 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| price_x | 0 | 1 | 102.23 | 79.11 | 0 | 45 | 80 | 130.0 | 499 | ▇▃▁▁▁ |
| host_listings_count | 0 | 1 | 19.51 | 109.40 | 0 | 1 | 1 | 4.0 | 1687 | ▇▁▁▁▁ |
| accommodates | 0 | 1 | 3.12 | 1.92 | 1 | 2 | 2 | 4.0 | 16 | ▇▂▁▁▁ |
| bathrooms | 0 | 1 | 1.29 | 0.58 | 0 | 1 | 1 | 1.5 | 35 | ▇▁▁▁▁ |
| bedrooms | 0 | 1 | 1.38 | 0.88 | 0 | 1 | 1 | 2.0 | 50 | ▇▁▁▁▁ |
| beds | 0 | 1 | 1.70 | 1.22 | 0 | 1 | 1 | 2.0 | 50 | ▇▁▁▁▁ |
| guests_included | 0 | 1 | 1.57 | 1.25 | 1 | 1 | 1 | 2.0 | 46 | ▇▁▁▁▁ |
| extra_people | 0 | 1 | 7.43 | 14.01 | 0 | 0 | 0 | 10.0 | 247 | ▇▁▁▁▁ |
| minimum_nights_y | 0 | 1 | 4.60 | 19.28 | 1 | 1 | 2 | 3.0 | 1125 | ▇▁▁▁▁ |
| availability_30 | 0 | 1 | 4.84 | 5.50 | 0 | 0 | 0 | 11.0 | 30 | ▇▅▁▁▁ |
| availability_60 | 0 | 1 | 18.18 | 19.37 | 0 | 0 | 3 | 40.0 | 60 | ▇▁▁▆▁ |
| availability_90 | 0 | 1 | 31.99 | 33.09 | 0 | 0 | 14 | 70.0 | 90 | ▇▁▁▅▁ |
| availability_365_y | 0 | 1 | 115.11 | 135.34 | 0 | 0 | 66 | 237.0 | 365 | ▇▁▂▁▃ |
| number_of_reviews_y | 0 | 1 | 17.48 | 37.20 | 0 | 1 | 4 | 18.0 | 775 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1 | 4.88 | 9.78 | 0 | 0 | 1 | 6.0 | 334 | ▇▁▁▁▁ |
| calculated_host_listings_count_y | 0 | 1 | 15.80 | 81.36 | 1 | 1 | 1 | 4.0 | 921 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1 | 12.97 | 79.12 | 0 | 0 | 1 | 2.0 | 919 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1 | 2.17 | 12.89 | 0 | 0 | 1 | 1.0 | 223 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1 | 0.04 | 0.65 | 0 | 0 | 0 | 0.0 | 18 | ▇▁▁▁▁ |
## Classes 'data.table' and 'data.frame': 80234 obs. of 28 variables:
## $ price_x : int 88 65 100 300 150 29 100 195 72 80 ...
## $ experiences_offered : Factor w/ 5 levels "business","family",..: 2 1 4 3 1 3 3 3 3 3 ...
## $ host_is_superhost : Factor w/ 2 levels "f","t": 1 1 1 1 1 2 1 2 2 1 ...
## $ host_listings_count : num 3 4 1 18 3 3 2 1 4 2 ...
## $ host_has_profile_pic : Factor w/ 2 levels "f","t": 2 2 2 2 2 2 2 2 2 2 ...
## $ host_identity_verified : Factor w/ 2 levels "f","t": 2 2 2 2 1 1 2 2 1 1 ...
## $ is_location_exact : Factor w/ 2 levels "f","t": 2 2 2 1 2 2 2 1 1 1 ...
## $ room_type_y : Factor w/ 4 levels "Entire home/apt",..: 1 3 1 1 3 3 3 1 3 1 ...
## $ accommodates : int 4 2 2 6 2 2 2 5 2 2 ...
## $ bathrooms : num 1 1 1 2 1 1.5 1 1.5 0 1 ...
## $ bedrooms : num 1 1 1 3 1 1 1 3 1 1 ...
## $ beds : num 3 0 1 3 1 0 1 3 0 1 ...
## $ guests_included : int 2 1 2 3 1 1 1 1 1 1 ...
## $ extra_people : num 20 15 0 10 0 8 0 0 0 0 ...
## $ minimum_nights_y : int 2 1 10 3 3 10 1 3 2 6 ...
## $ availability_30 : int 12 12 0 9 11 13 13 10 11 11 ...
## $ availability_60 : int 42 42 0 39 41 43 43 40 38 41 ...
## $ availability_90 : int 72 72 13 69 71 73 73 62 54 71 ...
## $ availability_365_y : int 347 347 288 326 346 348 348 334 311 71 ...
## $ number_of_reviews_y : int 192 21 89 42 0 129 6 77 528 52 ...
## $ number_of_reviews_ltm : int 9 5 4 2 0 8 1 10 27 0 ...
## $ instant_bookable : Factor w/ 2 levels "f","t": 2 1 2 2 1 2 1 1 2 1 ...
## $ require_guest_profile_picture : Factor w/ 2 levels "f","t": 1 1 2 1 1 1 1 1 1 1 ...
## $ require_guest_phone_verification : Factor w/ 2 levels "f","t": 2 1 2 1 1 1 1 1 1 1 ...
## $ calculated_host_listings_count_y : int 2 3 1 15 2 3 2 1 4 1 ...
## $ calculated_host_listings_count_entire_homes : int 2 1 1 15 0 0 1 1 0 1 ...
## $ calculated_host_listings_count_private_rooms: int 0 2 0 0 2 3 1 0 4 0 ...
## $ calculated_host_listings_count_shared_rooms : int 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
In this step I split my dataset in to training and test groups. While doing that I leave 70% in training set and 30% in the test set.
airbnb_data_omit2 <- na.omit(airbnb_data_fix)
airbnb_data_omit2 = airbnb_data_omit2[sample(nrow(airbnb_data_omit2), 4000), ]
set.seed(123)
split_airbnb <- createDataPartition(airbnb_data_omit2$`price_x`, p = .7,
list = FALSE,
times = 1)
airbnb_train <- airbnb_data_omit2[split_airbnb,]
airbnb_test <- airbnb_data_omit2[-split_airbnb,]airbnb_train_matrix <- data.matrix(airbnb_train[,-c("price_x")])
airbnb_test_matrix <- data.matrix(airbnb_test[,-c("price_x")])
airbnb_target_train <- data.matrix(airbnb_train$price_x)
airbnb_PRA_cv = cv.glmnet(airbnb_train_matrix, airbnb_target_train , family="gaussian")
plot(airbnb_PRA_cv)## [1] 0.4285138
## [1] 7.66461
I saw that there in no great difference in complexity with min lambda and 1se lambda so, I decided to use min lambda to build the model.
airbnb_PRA <- glmnet(airbnb_train_matrix, as.matrix(airbnb_train$price_x), family="gaussian", alpha=1, lambda= airbnb_PRA_cv$lambda.1se)
summary(airbnb_PRA)## Length Class Mode
## a0 1 -none- numeric
## beta 27 dgCMatrix S4
## df 1 -none- numeric
## dim 2 -none- numeric
## lambda 1 -none- numeric
## dev.ratio 1 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 6 -none- call
## nobs 1 -none- numeric
airbnb_tune_grid = expand.grid(min_leaf_obs=seq(2,5,1),complexity=seq(0,0.03,0.01))
folds <- createFolds(1:nrow(airbnb_train), k = 10)
airbnb_cv_data = airbnb_train
airbnb_cv_param = tibble()
airbnb_all_cv_stat = data.table()
set.seed(123)
for(p in 1:nrow(airbnb_tune_grid)) {
airbnb_temp_param = airbnb_tune_grid[p,]
airbnb_temp_result = tibble()
for(i in 1:10){
airbnb_temp_test = airbnb_cv_data[folds[[i]],]
airbnb_temp_train = airbnb_cv_data[-folds[[i]],]
airbnb_temp_fit=rpart(price_x~.,data = airbnb_temp_train, method="anova", control = rpart.control(minbucket = airbnb_temp_param$min_leaf_obs,cp=airbnb_temp_param$complexity))
airbnb_temp_test$Prediction = predict(airbnb_temp_fit,airbnb_temp_test)
airbnb_temp_result=rbind(airbnb_temp_result,airbnb_temp_test)
}
airbnb_temp_stat = data.table(airbnb_temp_param, Accuracy= sum (airbnb_temp_test$Prediction==airbnb_temp_test$price_x)/nrow(airbnb_temp_test))
print(airbnb_temp_stat)
airbnb_all_cv_stat = rbind(airbnb_all_cv_stat, airbnb_temp_stat)
}## min_leaf_obs complexity Accuracy
## 1: 2 0 0.007142857
## min_leaf_obs complexity Accuracy
## 1: 3 0 0.007142857
## min_leaf_obs complexity Accuracy
## 1: 4 0 0.007142857
## min_leaf_obs complexity Accuracy
## 1: 5 0 0.003571429
## min_leaf_obs complexity Accuracy
## 1: 2 0.01 0
## min_leaf_obs complexity Accuracy
## 1: 3 0.01 0
## min_leaf_obs complexity Accuracy
## 1: 4 0.01 0
## min_leaf_obs complexity Accuracy
## 1: 5 0.01 0
## min_leaf_obs complexity Accuracy
## 1: 2 0.02 0
## min_leaf_obs complexity Accuracy
## 1: 3 0.02 0
## min_leaf_obs complexity Accuracy
## 1: 4 0.02 0
## min_leaf_obs complexity Accuracy
## 1: 5 0.02 0
## min_leaf_obs complexity Accuracy
## 1: 2 0.03 0
## min_leaf_obs complexity Accuracy
## 1: 3 0.03 0
## min_leaf_obs complexity Accuracy
## 1: 4 0.03 0
## min_leaf_obs complexity Accuracy
## 1: 5 0.03 0
According to the cross validation results the best accuracy was obtained with the following parameters.
## min_leaf_obs complexity Accuracy
## 1: 2 0 0.007142857
Then I use these parameters that perform best for building the CART tree.
airbnb_DT = rpart(price_x~.,data = airbnb_train, method="anova",control = rpart.control(minbucket = airbnb_best_DT$min_leaf_obs,cp=airbnb_best_DT$complexity))
fancyRpartPlot(airbnb_DT) A very complex tree was built there may be overfitting.
For RF we set only m (set other parameters as J=500 and the minimal number of observations per tree leaf=5)
For mtry square root of features are taken generally. As there are 30 variables I am doing cross validation for 5, 6 and 7.
set.seed(123)
airbnb_m=c(5,6,7)
J=500
min_size_of_terminal_nodes=5
n_folds=10
airbnb_RF_fitControl=trainControl(method = "repeatedcv",
number = n_folds,
repeats = 1, search="random")
airbnb_RF_grid=expand.grid(mtry = airbnb_m)
airbnb_RF_train <- train(price_x~., data = airbnb_train, method = "rf", trControl = airbnb_RF_fitControl, ntree=J, nodesize = min_size_of_terminal_nodes, tuneGrid = airbnb_RF_grid)
print(airbnb_RF_train[["results"]])## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 5 51.21600 0.5960443 33.77423 6.536677 0.06448376 2.846482
## 2 6 51.10840 0.5975845 33.65839 6.441859 0.06435204 2.813055
## 3 7 51.12061 0.5970414 33.69530 6.507759 0.06396738 2.791607
set.seed(123)
airbnb_GBM_control <- trainControl(method='cv', number=5, search='grid')
airbnb_GBM_grid <- expand.grid(interaction.depth = c(2, 3, 5), n.trees = c(50,100), shrinkage = c(0.05,0.1), n.minobsinnode = 10)
print(airbnb_GBM_grid)## interaction.depth n.trees shrinkage n.minobsinnode
## 1 2 50 0.05 10
## 2 3 50 0.05 10
## 3 5 50 0.05 10
## 4 2 100 0.05 10
## 5 3 100 0.05 10
## 6 5 100 0.05 10
## 7 2 50 0.10 10
## 8 3 50 0.10 10
## 9 5 50 0.10 10
## 10 2 100 0.10 10
## 11 3 100 0.10 10
## 12 5 100 0.10 10
set.seed(123)
airbnb_GBM_train <- train(price_x~.,
data = airbnb_train,
method = 'gbm',
metric = 'RMSE',
trControl=airbnb_GBM_control,
tuneGrid = airbnb_GBM_grid,
verbose=FALSE)
print(airbnb_GBM_train)## Stochastic Gradient Boosting
##
## 2802 samples
## 27 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 2242, 2242, 2239, 2243, 2242
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.05 2 50 55.58109 0.5470749 37.32496
## 0.05 2 100 53.10155 0.5680287 35.12149
## 0.05 3 50 54.21046 0.5634243 36.24254
## 0.05 3 100 52.03683 0.5837548 34.25918
## 0.05 5 50 53.19897 0.5741560 35.39588
## 0.05 5 100 51.28301 0.5944135 33.59028
## 0.10 2 50 52.87808 0.5722498 35.00836
## 0.10 2 100 51.43884 0.5917479 33.81389
## 0.10 3 50 52.13886 0.5817968 34.26897
## 0.10 3 100 51.08181 0.5974978 33.41614
## 0.10 5 50 51.57281 0.5893508 33.90920
## 0.10 5 100 51.02036 0.5988958 33.39772
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100, interaction.depth =
## 5, shrinkage = 0.1 and n.minobsinnode = 10.
## var
## accommodates accommodates
## bedrooms bedrooms
## bathrooms bathrooms
## calculated_host_listings_count_entire_homes calculated_host_listings_count_entire_homes
## room_type_yPrivate room room_type_yPrivate room
## host_listings_count host_listings_count
## extra_people extra_people
## calculated_host_listings_count_y calculated_host_listings_count_y
## availability_365_y availability_365_y
## minimum_nights_y minimum_nights_y
## guests_included guests_included
## number_of_reviews_y number_of_reviews_y
## availability_30 availability_30
## beds beds
## availability_90 availability_90
## number_of_reviews_ltm number_of_reviews_ltm
## availability_60 availability_60
## instant_bookablet instant_bookablet
## calculated_host_listings_count_private_rooms calculated_host_listings_count_private_rooms
## room_type_yShared room room_type_yShared room
## is_location_exactt is_location_exactt
## calculated_host_listings_count_shared_rooms calculated_host_listings_count_shared_rooms
## host_is_superhostt host_is_superhostt
## host_identity_verifiedt host_identity_verifiedt
## experiences_offeredfamily experiences_offeredfamily
## experiences_offerednone experiences_offerednone
## experiences_offeredromantic experiences_offeredromantic
## experiences_offeredsocial experiences_offeredsocial
## host_has_profile_pict host_has_profile_pict
## room_type_yHotel room room_type_yHotel room
## require_guest_profile_picturet require_guest_profile_picturet
## require_guest_phone_verificationt require_guest_phone_verificationt
## rel.inf
## accommodates 34.4415463
## bedrooms 11.2423056
## bathrooms 10.3089723
## calculated_host_listings_count_entire_homes 9.3330259
## room_type_yPrivate room 8.3037269
## host_listings_count 5.0822816
## extra_people 3.6024284
## calculated_host_listings_count_y 3.3069958
## availability_365_y 3.1281807
## minimum_nights_y 2.0652280
## guests_included 1.6685606
## number_of_reviews_y 1.3469511
## availability_30 0.9935921
## beds 0.9179146
## availability_90 0.9048866
## number_of_reviews_ltm 0.8121054
## availability_60 0.7607171
## instant_bookablet 0.4349347
## calculated_host_listings_count_private_rooms 0.4181749
## room_type_yShared room 0.2449279
## is_location_exactt 0.2267264
## calculated_host_listings_count_shared_rooms 0.2027996
## host_is_superhostt 0.1391778
## host_identity_verifiedt 0.1138399
## experiences_offeredfamily 0.0000000
## experiences_offerednone 0.0000000
## experiences_offeredromantic 0.0000000
## experiences_offeredsocial 0.0000000
## host_has_profile_pict 0.0000000
## room_type_yHotel room 0.0000000
## require_guest_profile_picturet 0.0000000
## require_guest_phone_verificationt 0.0000000
airbnb_PRA_predictions_train <- predict(airbnb_PRA_cv, airbnb_train_matrix, s='lambda.1se')
airbnb_PRA_predictions_test <- predict(airbnb_PRA_cv, airbnb_test_matrix, s='lambda.1se')
airbnb_PRA_Rsquared_train <- cor(airbnb_PRA_predictions_train, airbnb_train$price_x) ^ 2
airbnb_PRA_Rsquared_test <- cor(airbnb_PRA_predictions_test, airbnb_test$price_x) ^ 2
airbnb_DT_predictions_train <- predict(airbnb_DT, airbnb_train[,-c("price_x")], type="vector")
airbnb_DT_predictions_test <- predict(airbnb_DT, airbnb_test[,-c("price_x")], type="vector")
airbnb_DT_Rsquared_train <- cor(airbnb_DT_predictions_train, airbnb_train$price_x) ^ 2
airbnb_DT_Rsquared_test <- cor(airbnb_DT_predictions_test, airbnb_test$price_x) ^ 2
airbnb_RF_predictions_train <- predict(airbnb_RF_train, airbnb_train, type="raw")
airbnb_RF_predictions_test <- predict(airbnb_RF_train, airbnb_test, type="raw")
airbnb_RF_Rsquared_train <- cor(airbnb_RF_predictions_train , airbnb_train$price_x) ^ 2
airbnb_RF_Rsquared_test <- cor(airbnb_RF_predictions_test, airbnb_test$price_x) ^ 2
airbnb_SGB_predictions_train <- predict(airbnb_GBM_train, airbnb_train, type="raw")
airbnb_SGB_predictions_test <- predict(airbnb_GBM_train, airbnb_test, type="raw")
airbnb_SGB_Rsquared_train <- cor(airbnb_SGB_predictions_train , airbnb_train$price_x) ^ 2
airbnb_SGB_Rsquared_test <- cor(airbnb_SGB_predictions_test, airbnb_test$price_x) ^ 2
airbnb_test_performances <- data.table(airbnb_PRA=airbnb_PRA_Rsquared_test, airbnb_DT_Rsquared_test, airbnb_RF_Rsquared_test, airbnb_SGB_Rsquared_test)
airbnb_train_performances <- data.table(airbnb_PRA=airbnb_PRA_Rsquared_train, airbnb_DT_Rsquared_train, airbnb_RF_Rsquared_train, airbnb_SGB_Rsquared_train)
airbnb_test_performances## airbnb_PRA.V1 airbnb_DT_Rsquared_test airbnb_RF_Rsquared_test
## 1: 0.4898242 0.3559494 0.5821181
## airbnb_SGB_Rsquared_test
## 1: 0.5923573
## airbnb_PRA.V1 airbnb_DT_Rsquared_train airbnb_RF_Rsquared_train
## 1: 0.5209606 0.8971872 0.9088639
## airbnb_SGB_Rsquared_train
## 1: 0.6956277
While I was exploring the data I saw that all variables were
colnames(forest_data) <- make.names(colnames(forest_data))
forest_data_factors <- forest_data %>%
select(11:54)
forest_factors_names <- colnames(forest_data_factors)
forest_factors_names## [1] "Wilderness_Area1" "Wilderness_Area2" "Wilderness_Area3" "Wilderness_Area4"
## [5] "Soil_Type1" "Soil_Type2" "Soil_Type3" "Soil_Type4"
## [9] "Soil_Type5" "Soil_Type6" "Soil_Type7" "Soil_Type8"
## [13] "Soil_Type9" "Soil_Type10" "Soil_Type11" "Soil_Type12"
## [17] "Soil_Type13" "Soil_Type14" "Soil_Type15" "Soil_Type16"
## [21] "Soil_Type17" "Soil_Type18" "Soil_Type19" "Soil_Type20"
## [25] "Soil_Type21" "Soil_Type22" "Soil_Type23" "Soil_Type24"
## [29] "Soil_Type25" "Soil_Type26" "Soil_Type27" "Soil_Type28"
## [33] "Soil_Type29" "Soil_Type30" "Soil_Type31" "Soil_Type32"
## [37] "Soil_Type33" "Soil_Type34" "Soil_Type35" "Soil_Type36"
## [41] "Soil_Type37" "Soil_Type38" "Soil_Type39" "Soil_Type40"
forest_data_fix <- forest_data %>%
mutate_at(c("Wilderness_Area1", "Wilderness_Area2", "Wilderness_Area3" ,"Wilderness_Area4" ,"Soil_Type1", "Soil_Type2", "Soil_Type3", "Soil_Type4", "Soil_Type5", "Soil_Type6", "Soil_Type7", "Soil_Type8" , "Soil_Type9", "Soil_Type10" , "Soil_Type11", "Soil_Type12", "Soil_Type13" ,"Soil_Type14", "Soil_Type15" , "Soil_Type16" , "Soil_Type17", "Soil_Type18", "Soil_Type19", "Soil_Type20" , "Soil_Type21", "Soil_Type22" , "Soil_Type23" ,"Soil_Type24", "Soil_Type25", "Soil_Type26" , "Soil_Type27" , "Soil_Type28" ,"Soil_Type29", "Soil_Type30", "Soil_Type31", "Soil_Type32", "Soil_Type33", "Soil_Type34", "Soil_Type35", "Soil_Type36", "Soil_Type37" , "Soil_Type38", "Soil_Type39", "Soil_Type40", "Cover_Type"), factor)
str(forest_data_fix) ## Classes 'data.table' and 'data.frame': 581012 obs. of 55 variables:
## $ Elevation : int 2596 2590 2804 2785 2595 2579 2606 2605 2617 2612 ...
## $ Aspect : int 51 56 139 155 45 132 45 49 45 59 ...
## $ Slope : int 3 2 9 18 2 6 7 4 9 10 ...
## $ Horizontal_Distance_To_Hydrology : int 258 212 268 242 153 300 270 234 240 247 ...
## $ Vertical_Distance_To_Hydrology : int 0 -6 65 118 -1 -15 5 7 56 11 ...
## $ Horizontal_Distance_To_Roadways : int 510 390 3180 3090 391 67 633 573 666 636 ...
## $ Hillshade_9am : int 221 220 234 238 220 230 222 222 223 228 ...
## $ Hillshade_Noon : int 232 235 238 238 234 237 225 230 221 219 ...
## $ Hillshade_3pm : int 148 151 135 122 150 140 138 144 133 124 ...
## $ Horizontal_Distance_To_Fire_Points: int 6279 6225 6121 6211 6172 6031 6256 6228 6244 6230 ...
## $ Wilderness_Area1 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ Wilderness_Area2 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Wilderness_Area3 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Wilderness_Area4 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type1 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type2 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type3 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type4 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type5 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type6 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type7 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type8 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type9 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type10 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type11 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type12 : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
## $ Soil_Type13 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type14 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type15 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type16 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type17 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type18 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type19 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type20 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type21 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type22 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type23 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type24 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type25 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type26 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type27 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type28 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type29 : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
## $ Soil_Type30 : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
## $ Soil_Type31 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type32 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type33 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type34 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type35 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type36 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type37 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type38 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type39 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil_Type40 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Cover_Type : Factor w/ 7 levels "1","2","3","4",..: 5 5 2 2 5 2 5 5 5 5 ...
## - attr(*, ".internal.selfref")=<externalptr>
| Name | forest_data_fix |
| Number of rows | 581012 |
| Number of columns | 55 |
| _______________________ | |
| Column type frequency: | |
| factor | 45 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Wilderness_Area1 | 0 | 1 | FALSE | 2 | 0: 320216, 1: 260796 |
| Wilderness_Area2 | 0 | 1 | FALSE | 2 | 0: 551128, 1: 29884 |
| Wilderness_Area3 | 0 | 1 | FALSE | 2 | 0: 327648, 1: 253364 |
| Wilderness_Area4 | 0 | 1 | FALSE | 2 | 0: 544044, 1: 36968 |
| Soil_Type1 | 0 | 1 | FALSE | 2 | 0: 577981, 1: 3031 |
| Soil_Type2 | 0 | 1 | FALSE | 2 | 0: 573487, 1: 7525 |
| Soil_Type3 | 0 | 1 | FALSE | 2 | 0: 576189, 1: 4823 |
| Soil_Type4 | 0 | 1 | FALSE | 2 | 0: 568616, 1: 12396 |
| Soil_Type5 | 0 | 1 | FALSE | 2 | 0: 579415, 1: 1597 |
| Soil_Type6 | 0 | 1 | FALSE | 2 | 0: 574437, 1: 6575 |
| Soil_Type7 | 0 | 1 | FALSE | 2 | 0: 580907, 1: 105 |
| Soil_Type8 | 0 | 1 | FALSE | 2 | 0: 580833, 1: 179 |
| Soil_Type9 | 0 | 1 | FALSE | 2 | 0: 579865, 1: 1147 |
| Soil_Type10 | 0 | 1 | FALSE | 2 | 0: 548378, 1: 32634 |
| Soil_Type11 | 0 | 1 | FALSE | 2 | 0: 568602, 1: 12410 |
| Soil_Type12 | 0 | 1 | FALSE | 2 | 0: 551041, 1: 29971 |
| Soil_Type13 | 0 | 1 | FALSE | 2 | 0: 563581, 1: 17431 |
| Soil_Type14 | 0 | 1 | FALSE | 2 | 0: 580413, 1: 599 |
| Soil_Type15 | 0 | 1 | FALSE | 2 | 0: 581009, 1: 3 |
| Soil_Type16 | 0 | 1 | FALSE | 2 | 0: 578167, 1: 2845 |
| Soil_Type17 | 0 | 1 | FALSE | 2 | 0: 577590, 1: 3422 |
| Soil_Type18 | 0 | 1 | FALSE | 2 | 0: 579113, 1: 1899 |
| Soil_Type19 | 0 | 1 | FALSE | 2 | 0: 576991, 1: 4021 |
| Soil_Type20 | 0 | 1 | FALSE | 2 | 0: 571753, 1: 9259 |
| Soil_Type21 | 0 | 1 | FALSE | 2 | 0: 580174, 1: 838 |
| Soil_Type22 | 0 | 1 | FALSE | 2 | 0: 547639, 1: 33373 |
| Soil_Type23 | 0 | 1 | FALSE | 2 | 0: 523260, 1: 57752 |
| Soil_Type24 | 0 | 1 | FALSE | 2 | 0: 559734, 1: 21278 |
| Soil_Type25 | 0 | 1 | FALSE | 2 | 0: 580538, 1: 474 |
| Soil_Type26 | 0 | 1 | FALSE | 2 | 0: 578423, 1: 2589 |
| Soil_Type27 | 0 | 1 | FALSE | 2 | 0: 579926, 1: 1086 |
| Soil_Type28 | 0 | 1 | FALSE | 2 | 0: 580066, 1: 946 |
| Soil_Type29 | 0 | 1 | FALSE | 2 | 0: 465765, 1: 115247 |
| Soil_Type30 | 0 | 1 | FALSE | 2 | 0: 550842, 1: 30170 |
| Soil_Type31 | 0 | 1 | FALSE | 2 | 0: 555346, 1: 25666 |
| Soil_Type32 | 0 | 1 | FALSE | 2 | 0: 528493, 1: 52519 |
| Soil_Type33 | 0 | 1 | FALSE | 2 | 0: 535858, 1: 45154 |
| Soil_Type34 | 0 | 1 | FALSE | 2 | 0: 579401, 1: 1611 |
| Soil_Type35 | 0 | 1 | FALSE | 2 | 0: 579121, 1: 1891 |
| Soil_Type36 | 0 | 1 | FALSE | 2 | 0: 580893, 1: 119 |
| Soil_Type37 | 0 | 1 | FALSE | 2 | 0: 580714, 1: 298 |
| Soil_Type38 | 0 | 1 | FALSE | 2 | 0: 565439, 1: 15573 |
| Soil_Type39 | 0 | 1 | FALSE | 2 | 0: 567206, 1: 13806 |
| Soil_Type40 | 0 | 1 | FALSE | 2 | 0: 572262, 1: 8750 |
| Cover_Type | 0 | 1 | FALSE | 7 | 2: 283301, 1: 211840, 3: 35754, 7: 20510 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Elevation | 0 | 1 | 2959.37 | 279.98 | 1859 | 2809 | 2996 | 3163 | 3858 | ▁▂▇▇▁ |
| Aspect | 0 | 1 | 155.66 | 111.91 | 0 | 58 | 127 | 260 | 360 | ▇▆▃▃▅ |
| Slope | 0 | 1 | 14.10 | 7.49 | 0 | 9 | 13 | 18 | 66 | ▇▆▁▁▁ |
| Horizontal_Distance_To_Hydrology | 0 | 1 | 269.43 | 212.55 | 0 | 108 | 218 | 384 | 1397 | ▇▃▁▁▁ |
| Vertical_Distance_To_Hydrology | 0 | 1 | 46.42 | 58.30 | -173 | 7 | 30 | 69 | 601 | ▁▇▁▁▁ |
| Horizontal_Distance_To_Roadways | 0 | 1 | 2350.15 | 1559.25 | 0 | 1106 | 1997 | 3328 | 7117 | ▇▇▅▂▁ |
| Hillshade_9am | 0 | 1 | 212.15 | 26.77 | 0 | 198 | 218 | 231 | 254 | ▁▁▁▃▇ |
| Hillshade_Noon | 0 | 1 | 223.32 | 19.77 | 0 | 213 | 226 | 237 | 254 | ▁▁▁▁▇ |
| Hillshade_3pm | 0 | 1 | 142.53 | 38.27 | 0 | 119 | 143 | 168 | 254 | ▁▂▇▆▁ |
| Horizontal_Distance_To_Fire_Points | 0 | 1 | 1980.29 | 1324.20 | 0 | 1024 | 1710 | 2550 | 7173 | ▇▇▂▁▁ |
##
## 1 2 3 4 5 6 7
## 211840 283301 35754 2747 9493 17367 20510
When I checked the data with skimr() and str() I saw that there are no missing values. Also after transforming dummy encodings from numerical to factor data is cleaned.
There are over 500.000 observations in my data set it was taking too long to builds models so I have sampled 4000 from it. After the sampling I seperated 70% of the data for train and 30% for testing.
forest_data_omit <- na.omit(forest_data_fix)
forest_data_omit = forest_data_omit[sample(nrow(forest_data_omit), 4000), ]
set.seed(12345)
split_forest <- createDataPartition(forest_data_omit$`Cover_Type`, p = .7,
list = FALSE,
times = 1)
forest_train <- forest_data_omit[split_forest,]
forest_test <- forest_data_omit[-split_forest,]In penalized regression approach we need to do standardization. But our data has some dummy encodings. And when we take them to [-3, 3] range it doesn’t really make sense. So, I only applied standardization to numerical values. There will be still a little difference in scaling but it will be a lot better than what we started with. I have standardized the test and train data to be used in the penalized regression approach. In tree based approaches we do not need to use the standardized versions.
forest_train_standard <- forest_train %>%
mutate_if(is.integer, scale)
forest_test_standard <- forest_test %>%
mutate_if(is.integer, scale)forest_train_matrix = matrix()
forest_test_matrix = matrix()
forest_target = matrix()
forest_train_matrix <- data.matrix(forest_train_standard[,-c("Cover_Type")])
forest_test_matrix <- data.matrix(forest_test_standard[,-c("Cover_Type")])
forest_target <- data.matrix(forest_train_standard$Cover_Type)
set.seed(123)
forest_PRA_cv = cv.glmnet(forest_train_matrix, forest_target, family="multinomial", nfolds=10)
forest_lambda_min = forest_PRA_cv$lambda.min
forest_lambda_min## [1] 0.0009328935
## [1] 0.003766107
I saw that there in no great difference in complexity with min lambda and 1se lambda so, I decided to use min lambda to build the model.
Using 10 fold cross validation
forest_tune_grid = expand.grid(min_leaf_obs=seq(5,15,1),complexity=seq(0,0.3,0.1))
folds <- createFolds(1:nrow(forest_train), k = 10)
forest_cv_data = forest_train
forest_cv_param = tibble()
forest_all_cv_stat = data.table()
set.seed(123)
for(p in 1:nrow(forest_tune_grid)) {
forest_temp_param = forest_tune_grid[p,]
forest_temp_result = tibble()
for(i in 1:10){
forest_temp_test = forest_cv_data[folds[[i]],]
forest_temp_train = forest_cv_data[-folds[[i]],]
forest_temp_fit=rpart(Cover_Type~.,data = forest_temp_train, method="class", control = rpart.control(minbucket = forest_temp_param$min_leaf_obs,cp=forest_temp_param$complexity))
forest_temp_test$Prediction = predict(forest_temp_fit,forest_temp_test,type="class")
forest_temp_result=rbind(forest_temp_result,forest_temp_test)
}
forest_temp_stat = data.table(forest_temp_param, Accuracy= sum (forest_temp_test$Prediction==forest_temp_test$Cover_Type)/nrow(forest_temp_test))
print(forest_temp_stat)
forest_all_cv_stat = rbind(forest_all_cv_stat, forest_temp_stat)
}## min_leaf_obs complexity Accuracy
## 1: 5 0 0.7071429
## min_leaf_obs complexity Accuracy
## 1: 6 0 0.7142857
## min_leaf_obs complexity Accuracy
## 1: 7 0 0.7071429
## min_leaf_obs complexity Accuracy
## 1: 8 0 0.7071429
## min_leaf_obs complexity Accuracy
## 1: 9 0 0.6928571
## min_leaf_obs complexity Accuracy
## 1: 10 0 0.6678571
## min_leaf_obs complexity Accuracy
## 1: 11 0 0.6785714
## min_leaf_obs complexity Accuracy
## 1: 12 0 0.6892857
## min_leaf_obs complexity Accuracy
## 1: 13 0 0.6928571
## min_leaf_obs complexity Accuracy
## 1: 14 0 0.7178571
## min_leaf_obs complexity Accuracy
## 1: 15 0 0.7071429
## min_leaf_obs complexity Accuracy
## 1: 5 0.1 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 6 0.1 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 7 0.1 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 8 0.1 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 9 0.1 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 10 0.1 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 11 0.1 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 12 0.1 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 13 0.1 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 14 0.1 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 15 0.1 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 5 0.2 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 6 0.2 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 7 0.2 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 8 0.2 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 9 0.2 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 10 0.2 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 11 0.2 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 12 0.2 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 13 0.2 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 14 0.2 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 15 0.2 0.6607143
## min_leaf_obs complexity Accuracy
## 1: 5 0.3 0.4964286
## min_leaf_obs complexity Accuracy
## 1: 6 0.3 0.4964286
## min_leaf_obs complexity Accuracy
## 1: 7 0.3 0.4964286
## min_leaf_obs complexity Accuracy
## 1: 8 0.3 0.4964286
## min_leaf_obs complexity Accuracy
## 1: 9 0.3 0.4964286
## min_leaf_obs complexity Accuracy
## 1: 10 0.3 0.4964286
## min_leaf_obs complexity Accuracy
## 1: 11 0.3 0.4964286
## min_leaf_obs complexity Accuracy
## 1: 12 0.3 0.4964286
## min_leaf_obs complexity Accuracy
## 1: 13 0.3 0.4964286
## min_leaf_obs complexity Accuracy
## 1: 14 0.3 0.4964286
## min_leaf_obs complexity Accuracy
## 1: 15 0.3 0.4964286
According to the cross validation results the best accuracy was obtained with the following parameters.
## min_leaf_obs complexity Accuracy
## 1: 14 0 0.7178571
Then I use these parameters that perform best for building the CART tree.
forest_DT = rpart(Cover_Type~.,data = forest_train, method="class",control = rpart.control(minbucket = forest_best_DT$min_leaf_obs,cp=forest_best_DT$complexity))
fancyRpartPlot(forest_DT) As we can see that the tree generated is not too complex.
For RF we set only m (set other parameters as J=500 and the minimal number of observations per tree leaf=5)
For mtry square root of features are taken generally. As there are 55 variables I am doing cross validation for 7, 8 and 9.
set.seed(123)
forest_m=c(7,8,9)
J=500
min_size_of_terminal_nodes=5
n_folds=10
forest_RF_fitControl=trainControl(method = "repeatedcv",
number = n_folds,
repeats = 1, classProbs = TRUE, search="random")
forest_RF_grid=expand.grid(mtry = forest_m)
forest_RF_train <- train(make.names(as.factor(Cover_Type)) ~ ., data = forest_train, method = "rf", trControl = forest_RF_fitControl, ntree=J, nodesize = min_size_of_terminal_nodes, tuneGrid = forest_RF_grid)
print(forest_RF_train[["results"]])## mtry Accuracy Kappa AccuracySD KappaSD
## 1 7 0.7393034 0.5715715 0.02120008 0.03526393
## 2 8 0.7453787 0.5820367 0.02250172 0.03686341
## 3 9 0.7500178 0.5903403 0.02248470 0.03678760
set.seed(123)
forest_GBM_Control <- trainControl(method='cv',
number=5,
search='grid')
forest_GBM_Grid <- expand.grid(interaction.depth = c(2, 3, 5),
n.trees = c(50,100),
shrinkage = c(0.05, 0.1),
n.minobsinnode = 10)
print(forest_GBM_Grid)## interaction.depth n.trees shrinkage n.minobsinnode
## 1 2 50 0.05 10
## 2 3 50 0.05 10
## 3 5 50 0.05 10
## 4 2 100 0.05 10
## 5 3 100 0.05 10
## 6 5 100 0.05 10
## 7 2 50 0.10 10
## 8 3 50 0.10 10
## 9 5 50 0.10 10
## 10 2 100 0.10 10
## 11 3 100 0.10 10
## 12 5 100 0.10 10
forest_GBM_train <- train(Cover_Type ~ .,
data =forest_train,
method = 'gbm',
metric = 'Accuracy',
trControl=forest_GBM_Control,
tuneGrid = forest_GBM_Grid,
verbose=FALSE)
print(forest_GBM_train)## Stochastic Gradient Boosting
##
## 2804 samples
## 54 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 2243, 2245, 2244, 2243, 2241
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees Accuracy Kappa
## 0.05 2 50 0.7072362 0.5156248
## 0.05 2 100 0.7243537 0.5483239
## 0.05 3 50 0.7115016 0.5252034
## 0.05 3 100 0.7189895 0.5406846
## 0.05 5 50 0.7232645 0.5452101
## 0.05 5 100 0.7328991 0.5649700
## 0.10 2 50 0.7222058 0.5453391
## 0.10 2 100 0.7243511 0.5504780
## 0.10 3 50 0.7232714 0.5493122
## 0.10 3 100 0.7257631 0.5543864
## 0.10 5 50 0.7257543 0.5524149
## 0.10 5 100 0.7303844 0.5619303
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
## 5, shrinkage = 0.05 and n.minobsinnode = 10.
## var
## Elevation Elevation
## Horizontal_Distance_To_Roadways Horizontal_Distance_To_Roadways
## Horizontal_Distance_To_Hydrology Horizontal_Distance_To_Hydrology
## Horizontal_Distance_To_Fire_Points Horizontal_Distance_To_Fire_Points
## Hillshade_Noon Hillshade_Noon
## Hillshade_9am Hillshade_9am
## Vertical_Distance_To_Hydrology Vertical_Distance_To_Hydrology
## Hillshade_3pm Hillshade_3pm
## Aspect Aspect
## Slope Slope
## Soil_Type41 Soil_Type41
## Soil_Type221 Soil_Type221
## Soil_Type21 Soil_Type21
## Wilderness_Area11 Wilderness_Area11
## Wilderness_Area31 Wilderness_Area31
## Soil_Type231 Soil_Type231
## Soil_Type381 Soil_Type381
## Soil_Type101 Soil_Type101
## Soil_Type391 Soil_Type391
## Soil_Type171 Soil_Type171
## Soil_Type201 Soil_Type201
## Soil_Type121 Soil_Type121
## Soil_Type331 Soil_Type331
## Soil_Type321 Soil_Type321
## Soil_Type131 Soil_Type131
## Soil_Type301 Soil_Type301
## Soil_Type291 Soil_Type291
## Soil_Type61 Soil_Type61
## Soil_Type31 Soil_Type31
## Soil_Type401 Soil_Type401
## Soil_Type111 Soil_Type111
## Wilderness_Area41 Wilderness_Area41
## Wilderness_Area21 Wilderness_Area21
## Soil_Type311 Soil_Type311
## Soil_Type241 Soil_Type241
## Soil_Type11 Soil_Type11
## Soil_Type51 Soil_Type51
## Soil_Type71 Soil_Type71
## Soil_Type81 Soil_Type81
## Soil_Type91 Soil_Type91
## Soil_Type141 Soil_Type141
## Soil_Type151 Soil_Type151
## Soil_Type161 Soil_Type161
## Soil_Type181 Soil_Type181
## Soil_Type191 Soil_Type191
## Soil_Type211 Soil_Type211
## Soil_Type251 Soil_Type251
## Soil_Type261 Soil_Type261
## Soil_Type271 Soil_Type271
## Soil_Type281 Soil_Type281
## Soil_Type341 Soil_Type341
## Soil_Type351 Soil_Type351
## Soil_Type361 Soil_Type361
## Soil_Type371 Soil_Type371
## rel.inf
## Elevation 49.29781390
## Horizontal_Distance_To_Roadways 7.68576519
## Horizontal_Distance_To_Hydrology 5.37036421
## Horizontal_Distance_To_Fire_Points 5.33723021
## Hillshade_Noon 4.30047688
## Hillshade_9am 3.78463079
## Vertical_Distance_To_Hydrology 3.71051014
## Hillshade_3pm 2.60356599
## Aspect 2.45809099
## Slope 2.13396677
## Soil_Type41 1.96946027
## Soil_Type221 1.49068372
## Soil_Type21 1.30102109
## Wilderness_Area11 1.29960806
## Wilderness_Area31 1.16227898
## Soil_Type231 0.76412633
## Soil_Type381 0.71402480
## Soil_Type101 0.63877705
## Soil_Type391 0.48653901
## Soil_Type171 0.37445306
## Soil_Type201 0.36759796
## Soil_Type121 0.33314054
## Soil_Type331 0.33289954
## Soil_Type321 0.32347011
## Soil_Type131 0.31590870
## Soil_Type301 0.27391148
## Soil_Type291 0.24281779
## Soil_Type61 0.17044218
## Soil_Type31 0.16585563
## Soil_Type401 0.14155309
## Soil_Type111 0.13348599
## Wilderness_Area41 0.13202062
## Wilderness_Area21 0.10468099
## Soil_Type311 0.05241909
## Soil_Type241 0.02640886
## Soil_Type11 0.00000000
## Soil_Type51 0.00000000
## Soil_Type71 0.00000000
## Soil_Type81 0.00000000
## Soil_Type91 0.00000000
## Soil_Type141 0.00000000
## Soil_Type151 0.00000000
## Soil_Type161 0.00000000
## Soil_Type181 0.00000000
## Soil_Type191 0.00000000
## Soil_Type211 0.00000000
## Soil_Type251 0.00000000
## Soil_Type261 0.00000000
## Soil_Type271 0.00000000
## Soil_Type281 0.00000000
## Soil_Type341 0.00000000
## Soil_Type351 0.00000000
## Soil_Type361 0.00000000
## Soil_Type371 0.00000000
prediction_forest_PRA_test <- predict(forest_PRA_cv, newx=forest_test_matrix, s=c("lambda.min"), type="class")
prediction_forest_PRA_train <- predict(forest_PRA_cv, newx=forest_train_matrix, s=c("lambda.min"), type="class")
cm_forest_PRA_test <- confusionMatrix(factor(prediction_forest_PRA_test), factor(forest_test$Cover_Type))
cm_forest_PRA_train <- confusionMatrix(factor(prediction_forest_PRA_train), factor(forest_train$Cover_Type))
prediction_forest_DT_test <- predict(forest_DT, forest_test, type="class")
prediction_forest_DT_train <- predict(forest_DT, forest_train, type="class")
cm_forest_DT_test <- confusionMatrix(factor(prediction_forest_DT_test), factor(forest_test$Cover_Type))
cm_forest_DT_train <- confusionMatrix(factor(prediction_forest_DT_train), factor(forest_train$Cover_Type))
prediction_forest_RF_test <- predict(forest_RF_train, forest_test, type="raw")
prediction_forest_RF_train <- predict(forest_RF_train, forest_train, type="raw")
cm_forest_RF_test <- confusionMatrix(factor(prediction_forest_RF_test), factor(make.names(forest_test$Cover_Type)))
cm_forest_RF_train <- confusionMatrix(factor(prediction_forest_RF_train), factor(make.names(forest_train$Cover_Type)))
prediction_forest_SGB_test <- predict(forest_GBM_train, forest_test, type="raw")
prediction_forest_SGB_train <- predict(forest_GBM_train, forest_train, type="raw")
cm_forest_SGB_test <- confusionMatrix(factor(make.names(prediction_forest_SGB_test)), factor(make.names(forest_test$Cover_Type)))
cm_forest_SGB_train <- confusionMatrix(factor(make.names(prediction_forest_SGB_train)), factor(make.names(forest_train$Cover_Type)))
forest_test_performances <- data.table(PRA_test=cm_forest_PRA_test$overall["Accuracy"], DT_test=cm_forest_DT_test$overall["Accuracy"], RF_test=cm_forest_RF_test$overall["Accuracy"], SGB_test=cm_forest_SGB_test$overall["Accuracy"])
forest_train_performances <- data.table(PRA_train=cm_forest_PRA_train$overall["Accuracy"], DT_train=cm_forest_DT_train$overall["Accuracy"], RF_train=cm_forest_RF_train$overall["Accuracy"], SGB_train=cm_forest_SGB_train$overall["Accuracy"])
forest_test_performances## PRA_test DT_test RF_test SGB_test
## 1: 0.7006689 0.7073579 0.7399666 0.7249164
## PRA_train DT_train RF_train SGB_train
## 1: 0.735378 0.767475 0.9432953 0.8252496
HR data set is a binary dataset and it has 5:1 class imbalance
When I checked my data with lapply(HR_data, unique) I saw that EmployeeCount, Over18, StandardHours variables had only one unique value. As there is no variation in those features I can remove them.
colnames(HR_data) <- make.names(colnames(HR_data))
HR_data<- HR_data %>%
mutate_if(is.character, factor)
HR_data_drop <- select(HR_data, -c("EmployeeCount", "Over18", "StandardHours"))When I was exploring the data I saw that some categorical variables stored as integer even though there were only 2-3 type of levels in them. So, I filtered those variables and transformed them to factors.
## [1] "Attrition" "BusinessTravel"
## [3] "Department" "Education"
## [5] "EducationField" "EnvironmentSatisfaction"
## [7] "Gender" "JobInvolvement"
## [9] "JobLevel" "JobRole"
## [11] "JobSatisfaction" "MaritalStatus"
## [13] "OverTime" "PerformanceRating"
## [15] "RelationshipSatisfaction" "StockOptionLevel"
## [17] "TrainingTimesLastYear" "WorkLifeBalance"
HR_data_factors_fix <- HR_data_factors %>%
select_if(is.integer)
HR_data_fix <- HR_data_drop %>%
mutate_at(c( "Education", "EnvironmentSatisfaction", "JobInvolvement", "JobLevel", "JobSatisfaction", "PerformanceRating", "RelationshipSatisfaction", "StockOptionLevel", "TrainingTimesLastYear", "WorkLifeBalance"), factor)
str(HR_data_fix)## Classes 'data.table' and 'data.frame': 1470 obs. of 32 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : Factor w/ 5 levels "1","2","3","4",..: 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : Factor w/ 4 levels "1","2","3","4": 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : Factor w/ 5 levels "1","2","3","4",..: 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : Factor w/ 4 levels "1","2","3","4": 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 2 2 2 1 ...
## $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 1 4 2 3 4 3 1 2 2 2 ...
## $ StockOptionLevel : Factor w/ 4 levels "0","1","2","3": 1 2 1 1 2 1 4 2 1 3 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : Factor w/ 7 levels "0","1","2","3",..: 1 4 4 4 4 3 4 3 3 4 ...
## $ WorkLifeBalance : Factor w/ 4 levels "1","2","3","4": 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
## - attr(*, ".internal.selfref")=<externalptr>
| Name | HR_data_fix |
| Number of rows | 1470 |
| Number of columns | 32 |
| _______________________ | |
| Column type frequency: | |
| factor | 18 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Attrition | 0 | 1 | FALSE | 2 | No: 1233, Yes: 237 |
| BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 1043, Tra: 277, Non: 150 |
| Department | 0 | 1 | FALSE | 3 | Res: 961, Sal: 446, Hum: 63 |
| Education | 0 | 1 | FALSE | 5 | 3: 572, 4: 398, 2: 282, 1: 170 |
| EducationField | 0 | 1 | FALSE | 6 | Lif: 606, Med: 464, Mar: 159, Tec: 132 |
| EnvironmentSatisfaction | 0 | 1 | FALSE | 4 | 3: 453, 4: 446, 2: 287, 1: 284 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 882, Fem: 588 |
| JobInvolvement | 0 | 1 | FALSE | 4 | 3: 868, 2: 375, 4: 144, 1: 83 |
| JobLevel | 0 | 1 | FALSE | 5 | 1: 543, 2: 534, 3: 218, 4: 106 |
| JobRole | 0 | 1 | FALSE | 9 | Sal: 326, Res: 292, Lab: 259, Man: 145 |
| JobSatisfaction | 0 | 1 | FALSE | 4 | 4: 459, 3: 442, 1: 289, 2: 280 |
| MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 673, Sin: 470, Div: 327 |
| OverTime | 0 | 1 | FALSE | 2 | No: 1054, Yes: 416 |
| PerformanceRating | 0 | 1 | FALSE | 2 | 3: 1244, 4: 226 |
| RelationshipSatisfaction | 0 | 1 | FALSE | 4 | 3: 459, 4: 432, 2: 303, 1: 276 |
| StockOptionLevel | 0 | 1 | FALSE | 4 | 0: 631, 1: 596, 2: 158, 3: 85 |
| TrainingTimesLastYear | 0 | 1 | FALSE | 7 | 2: 547, 3: 491, 4: 123, 5: 119 |
| WorkLifeBalance | 0 | 1 | FALSE | 4 | 3: 893, 2: 344, 4: 153, 1: 80 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 36.92 | 9.14 | 18 | 30.00 | 36.0 | 43.00 | 60 | ▂▇▇▃▂ |
| DailyRate | 0 | 1 | 802.49 | 403.51 | 102 | 465.00 | 802.0 | 1157.00 | 1499 | ▇▇▇▇▇ |
| DistanceFromHome | 0 | 1 | 9.19 | 8.11 | 1 | 2.00 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
| EmployeeNumber | 0 | 1 | 1024.87 | 602.02 | 1 | 491.25 | 1020.5 | 1555.75 | 2068 | ▇▇▇▇▇ |
| HourlyRate | 0 | 1 | 65.89 | 20.33 | 30 | 48.00 | 66.0 | 83.75 | 100 | ▇▇▇▇▇ |
| MonthlyIncome | 0 | 1 | 6502.93 | 4707.96 | 1009 | 2911.00 | 4919.0 | 8379.00 | 19999 | ▇▅▂▁▂ |
| MonthlyRate | 0 | 1 | 14313.10 | 7117.79 | 2094 | 8047.00 | 14235.5 | 20461.50 | 26999 | ▇▇▇▇▇ |
| NumCompaniesWorked | 0 | 1 | 2.69 | 2.50 | 0 | 1.00 | 2.0 | 4.00 | 9 | ▇▃▂▂▁ |
| PercentSalaryHike | 0 | 1 | 15.21 | 3.66 | 11 | 12.00 | 14.0 | 18.00 | 25 | ▇▅▃▂▁ |
| TotalWorkingYears | 0 | 1 | 11.28 | 7.78 | 0 | 6.00 | 10.0 | 15.00 | 40 | ▇▇▂▁▁ |
| YearsAtCompany | 0 | 1 | 7.01 | 6.13 | 0 | 3.00 | 5.0 | 9.00 | 40 | ▇▂▁▁▁ |
| YearsInCurrentRole | 0 | 1 | 4.23 | 3.62 | 0 | 2.00 | 3.0 | 7.00 | 18 | ▇▃▂▁▁ |
| YearsSinceLastPromotion | 0 | 1 | 2.19 | 3.22 | 0 | 0.00 | 1.0 | 3.00 | 15 | ▇▁▁▁▁ |
| YearsWithCurrManager | 0 | 1 | 4.12 | 3.57 | 0 | 2.00 | 3.0 | 7.00 | 17 | ▇▂▅▁▁ |
In penalized regression approach we need to do standardization. But our data has some binary variables. So, I only applied standardization to numerical values. I have standardized the test and train data to be used in the penalized regression approach. In tree based approaches we do not need to use the standardized versions.
HR_train_standard <- HR_train %>%
mutate_if(is.numeric, scale)
HR_test_standard <- HR_test %>%
mutate_if(is.numeric, scale)HR_train_matrix = matrix()
HR_test_matrix = matrix()
HR_target = matrix()
HR_train_matrix <- data.matrix(HR_train_standard[,-c("Attrition")])
HR_test_matrix <- data.matrix(HR_test_standard[,-c("Attrition")])
HR_target <- data.matrix (HR_train_standard$Attrition)
HR_target_test <- data.matrix (HR_test_standard$Attrition)
set.seed(123)
HR_PRA_cv = cv.glmnet(HR_train_matrix, HR_target, family="binomial", nfolds=5)
plot(HR_PRA_cv)## [1] 0.003658811
## [1] 0.01477068
HR_PRA <- glmnet(HR_train_matrix, HR_target,family="binomial", alpha=1,lambda=HR_PRA_cv$lambda.min)
summary(HR_PRA)## Length Class Mode
## a0 1 -none- numeric
## beta 31 dgCMatrix S4
## df 1 -none- numeric
## dim 2 -none- numeric
## lambda 1 -none- numeric
## dev.ratio 1 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## classnames 2 -none- character
## call 6 -none- call
## nobs 1 -none- numeric
Using 10 fold cross validation
HR_tune_grid = expand.grid(min_leaf_obs=seq(3,10,1),complexity=seq(0,0.3,0.1))
HR_folds <- createFolds(1:nrow(HR_train), k = 10)
HR_cv_data = HR_train
HR_cv_param = tibble()
HR_all_cv_stat = data.table()
for(p in 1:nrow(HR_tune_grid)) {
HR_temp_param = HR_tune_grid[p,]
HR_temp_result = tibble()
for(i in 1:10){
HR_temp_test = HR_cv_data[HR_folds[[i]],]
HR_temp_train = HR_cv_data[-HR_folds[[i]],]
HR_temp_fit=rpart(Attrition~.,data = HR_temp_train, method="class", control = rpart.control(minbucket = HR_temp_param$min_leaf_obs,cp=HR_temp_param$complexity))
HR_temp_test$Prediction = predict(HR_temp_fit, HR_temp_test, type="class")
HR_temp_result=rbind( HR_temp_result, HR_temp_test)
}
HR_temp_stat = data.table(HR_temp_param, Accuracy= sum (HR_temp_test$Prediction==HR_temp_test$Attrition)/nrow(HR_temp_test))
print(HR_temp_stat)
HR_all_cv_stat = rbind(HR_all_cv_stat, HR_temp_stat)
}## min_leaf_obs complexity Accuracy
## 1: 3 0 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 4 0 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 5 0 0.8365385
## min_leaf_obs complexity Accuracy
## 1: 6 0 0.8173077
## min_leaf_obs complexity Accuracy
## 1: 7 0 0.7788462
## min_leaf_obs complexity Accuracy
## 1: 8 0 0.7884615
## min_leaf_obs complexity Accuracy
## 1: 9 0 0.7980769
## min_leaf_obs complexity Accuracy
## 1: 10 0 0.7980769
## min_leaf_obs complexity Accuracy
## 1: 3 0.1 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 4 0.1 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 5 0.1 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 6 0.1 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 7 0.1 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 8 0.1 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 9 0.1 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 10 0.1 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 3 0.2 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 4 0.2 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 5 0.2 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 6 0.2 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 7 0.2 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 8 0.2 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 9 0.2 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 10 0.2 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 3 0.3 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 4 0.3 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 5 0.3 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 6 0.3 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 7 0.3 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 8 0.3 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 9 0.3 0.8269231
## min_leaf_obs complexity Accuracy
## 1: 10 0.3 0.8269231
According to the cross validation results the best accuracy was obtained with the following parameters.
## min_leaf_obs complexity Accuracy
## 1: 5 0 0.8365385
Then I use these parameters that perform best for building the CART tree.
HR_DT = rpart(Attrition~. , data = HR_train, method="class", control = rpart.control(minbucket = HR_best_DT$min_leaf_obs,cp=HR_best_DT$complexity))
fancyRpartPlot(HR_DT) As we can see that the tree generated is not too complex.
For RF we set only m (set other parameters as J=500 and the minimal number of observations per tree leaf=5)
For mtry square root of features are taken generally. As there are 32 variables I am doing cross validation for 5, 6 and 7.
set.seed(123)
HR_m=c(5,6,7)
J=500
min_size_of_terminal_nodes=5
n_folds=10
HR_RF_fitControl=trainControl(method = "repeatedcv",
number = n_folds,
repeats = 1, classProbs = TRUE, search="random")
HR_RF_grid=expand.grid(mtry = HR_m)
HR_RF_train <- train(Attrition~., data = HR_train, method = "rf", trControl = HR_RF_fitControl, ntree=J, nodesize = min_size_of_terminal_nodes, tuneGrid = HR_RF_grid)
print(HR_RF_train[["results"]])## mtry Accuracy Kappa AccuracySD KappaSD
## 1 5 0.8514506 0.1446482 0.01380373 0.11599747
## 2 6 0.8524310 0.1619882 0.01349232 0.09460159
## 3 7 0.8524215 0.1590154 0.01763513 0.13454714
set.seed(123)
HR_GBM_control <- trainControl(method='cv', number=5, search='grid', summaryFunction = twoClassSummary, classProbs = T)
HR_GBM_grid <- expand.grid(interaction.depth = c(2, 3, 5), n.trees = c(50,100), shrinkage = c(0.05,0.1), n.minobsinnode = 10)
print(HR_GBM_grid)## interaction.depth n.trees shrinkage n.minobsinnode
## 1 2 50 0.05 10
## 2 3 50 0.05 10
## 3 5 50 0.05 10
## 4 2 100 0.05 10
## 5 3 100 0.05 10
## 6 5 100 0.05 10
## 7 2 50 0.10 10
## 8 3 50 0.10 10
## 9 5 50 0.10 10
## 10 2 100 0.10 10
## 11 3 100 0.10 10
## 12 5 100 0.10 10
set.seed(123)
HR_GBM_train <- train(Attrition~.,
data = HR_train,
method = 'gbm',
metric = 'ROC',
trControl=HR_GBM_control,
tuneGrid = HR_GBM_grid,
verbose=FALSE)
print(HR_GBM_train)## Stochastic Gradient Boosting
##
## 1030 samples
## 31 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 824, 823, 824, 825, 824
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees ROC Sens Spec
## 0.05 2 50 0.7404624 0.9918941 0.1270945
## 0.05 2 100 0.7588339 0.9861137 0.1812834
## 0.05 3 50 0.7496150 0.9872631 0.1513369
## 0.05 3 100 0.7575291 0.9768450 0.1992870
## 0.05 5 50 0.7496994 0.9895819 0.1511586
## 0.05 5 100 0.7629605 0.9803199 0.2051693
## 0.10 2 50 0.7617151 0.9861003 0.1869875
## 0.10 2 100 0.7743626 0.9722140 0.2475936
## 0.10 3 50 0.7598030 0.9745396 0.1934046
## 0.10 3 100 0.7758482 0.9641215 0.2475936
## 0.10 5 50 0.7659259 0.9675965 0.2413547
## 0.10 5 100 0.7713414 0.9687458 0.2955437
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
## var rel.inf
## MonthlyIncome MonthlyIncome 12.7869559
## OverTimeYes OverTimeYes 10.0230755
## Age Age 7.2209250
## DailyRate DailyRate 7.2082878
## HourlyRate HourlyRate 4.9710056
## MaritalStatusSingle MaritalStatusSingle 4.9515005
## DistanceFromHome DistanceFromHome 4.5041845
## EmployeeNumber EmployeeNumber 4.3631242
## BusinessTravelTravel_Frequently BusinessTravelTravel_Frequently 3.5621742
## TotalWorkingYears TotalWorkingYears 3.4584936
## PercentSalaryHike PercentSalaryHike 3.0576645
## YearsWithCurrManager YearsWithCurrManager 2.6995844
## YearsAtCompany YearsAtCompany 2.6825570
## YearsInCurrentRole YearsInCurrentRole 2.3647025
## WorkLifeBalance3 WorkLifeBalance3 1.9709473
## NumCompaniesWorked NumCompaniesWorked 1.9073885
## YearsSinceLastPromotion YearsSinceLastPromotion 1.9035499
## MonthlyRate MonthlyRate 1.7543968
## JobRoleSales Executive JobRoleSales Executive 1.7015673
## EnvironmentSatisfaction4 EnvironmentSatisfaction4 1.6908300
## JobSatisfaction4 JobSatisfaction4 1.4920081
## StockOptionLevel1 StockOptionLevel1 1.4841186
## DepartmentResearch & Development DepartmentResearch & Development 1.2983130
## JobRoleResearch Scientist JobRoleResearch Scientist 1.2192456
## RelationshipSatisfaction4 RelationshipSatisfaction4 0.9322413
## EducationFieldMarketing EducationFieldMarketing 0.8175701
## JobRoleLaboratory Technician JobRoleLaboratory Technician 0.8074269
## EducationFieldTechnical Degree EducationFieldTechnical Degree 0.6649801
## EducationFieldMedical EducationFieldMedical 0.6172724
## JobLevel3 JobLevel3 0.5573757
## GenderMale GenderMale 0.5007384
## EnvironmentSatisfaction3 EnvironmentSatisfaction3 0.4701174
## JobInvolvement3 JobInvolvement3 0.4306623
## WorkLifeBalance2 WorkLifeBalance2 0.4131238
## JobSatisfaction2 JobSatisfaction2 0.3842745
## RelationshipSatisfaction3 RelationshipSatisfaction3 0.3823399
## JobRoleSales Representative JobRoleSales Representative 0.3621029
## JobLevel2 JobLevel2 0.3486336
## TrainingTimesLastYear2 TrainingTimesLastYear2 0.2667205
## JobRoleHuman Resources JobRoleHuman Resources 0.2589413
## StockOptionLevel3 StockOptionLevel3 0.2566517
## JobInvolvement2 JobInvolvement2 0.2425063
## Education3 Education3 0.2375658
## RelationshipSatisfaction2 RelationshipSatisfaction2 0.1873844
## StockOptionLevel2 StockOptionLevel2 0.1761193
## JobInvolvement4 JobInvolvement4 0.1486218
## TrainingTimesLastYear5 TrainingTimesLastYear5 0.1420719
## TrainingTimesLastYear6 TrainingTimesLastYear6 0.1179573
## BusinessTravelTravel_Rarely BusinessTravelTravel_Rarely 0.0000000
## DepartmentSales DepartmentSales 0.0000000
## Education2 Education2 0.0000000
## Education4 Education4 0.0000000
## Education5 Education5 0.0000000
## EducationFieldLife Sciences EducationFieldLife Sciences 0.0000000
## EducationFieldOther EducationFieldOther 0.0000000
## EnvironmentSatisfaction2 EnvironmentSatisfaction2 0.0000000
## JobLevel4 JobLevel4 0.0000000
## JobLevel5 JobLevel5 0.0000000
## JobRoleManager JobRoleManager 0.0000000
## JobRoleManufacturing Director JobRoleManufacturing Director 0.0000000
## JobRoleResearch Director JobRoleResearch Director 0.0000000
## JobSatisfaction3 JobSatisfaction3 0.0000000
## MaritalStatusMarried MaritalStatusMarried 0.0000000
## PerformanceRating4 PerformanceRating4 0.0000000
## TrainingTimesLastYear1 TrainingTimesLastYear1 0.0000000
## TrainingTimesLastYear3 TrainingTimesLastYear3 0.0000000
## TrainingTimesLastYear4 TrainingTimesLastYear4 0.0000000
## WorkLifeBalance4 WorkLifeBalance4 0.0000000
prediction_HR_PRA_test <- predict(HR_PRA, newx=HR_test_matrix, s=c("lambda.min"), type="class")
prediction_HR_PRA_train <- predict(HR_PRA, newx=HR_train_matrix, s=c("lambda.min"), type="class")
cm_HR_PRA_test <- confusionMatrix(factor(prediction_HR_PRA_test), factor(HR_test$Attrition))
cm_HR_PRA_train <- confusionMatrix(factor(prediction_HR_PRA_train), factor(HR_train$Attrition))
prediction_HR_DT_test <- predict(HR_DT, HR_test, type="class")
prediction_HR_DT_train <- predict(HR_DT, HR_train, type="class")
cm_HR_DT_test <- confusionMatrix(factor(prediction_HR_DT_test), factor(HR_test$Attrition))
cm_HR_DT_train <- confusionMatrix(factor(prediction_HR_DT_train), factor(HR_train$Attrition))
prediction_HR_RF_test <- predict(HR_RF_train, HR_test, type="raw")
prediction_HR_RF_train <- predict(HR_RF_train, HR_train, type="raw")
cm_HR_RF_test <- confusionMatrix(factor(make.names(prediction_HR_RF_test)), factor(make.names(HR_test$Attrition)))
cm_HR_RF_train <- confusionMatrix(factor(make.names(prediction_HR_RF_train)), factor(make.names(HR_train$Attrition)))
prediction_HR_SGB_test <- predict(HR_GBM_train, HR_test, type="raw")
prediction_HR_SGB_train <- predict(HR_GBM_train, HR_train, type="raw")
cm_HR_SGB_test <- confusionMatrix(factor(make.names(prediction_HR_SGB_test)), factor(make.names(HR_test$Attrition)))
cm_HR_SGB_train <- confusionMatrix(factor(make.names(prediction_HR_SGB_train)), factor(make.names(HR_train$Attrition)))
HR_test_performances <- data.table(HR_PRA=cm_HR_PRA_test$overall["Accuracy"], HR_DT=cm_HR_DT_test$overall["Accuracy"], HR_RF=cm_HR_RF_test$overall["Accuracy"], HR_SGB=cm_HR_SGB_test$overall["Accuracy"])
HR_train_performances <- data.table(HR_PRA=cm_HR_PRA_train$overall["Accuracy"], HR_DT=cm_HR_DT_train$overall["Accuracy"], HR_RF=cm_HR_RF_train$overall["Accuracy"], HR_SGB=cm_HR_SGB_train$overall["Accuracy"])
HR_test_performances ## HR_PRA HR_DT HR_RF HR_SGB
## 1: 0.8818182 0.8022727 0.8522727 0.8727273
## HR_PRA HR_DT HR_RF HR_SGB
## 1: 0.8708738 0.9087379 0.961165 0.915534
First column gives the url of the online news. This columns is present only for descriptive purposes. To make a prediction I will drop this column since it is unnecessary. Also some other variables were found to have no variation. Or they were highly correlated. For example number of key words information is already available in the data set. Adding related features like average, min or max is not very meaningful. So, I removed those to clean the data.
news_data_drop <- select(news_data, -c("url", "timedelta", "n_non_stop_words", "kw_avg_max", "kw_avg_max" ,"kw_min_min", "kw_max_min","kw_avg_min", "kw_min_max", "kw_max_max", "kw_avg_max","kw_min_avg", "kw_max_avg", "kw_max_avg", "kw_avg_avg", "min_positive_polarity","max_positive_polarity","min_negative_polarity", "max_negative_polarity" ))When I was exploring the data set I saw that all variables, even binary ones were stored as numeric. So I filtered and mutated variables that have unique values less than 5 as factor.
## [1] "data_channel_is_lifestyle" "data_channel_is_entertainment"
## [3] "data_channel_is_bus" "data_channel_is_socmed"
## [5] "data_channel_is_tech" "data_channel_is_world"
## [7] "weekday_is_monday" "weekday_is_tuesday"
## [9] "weekday_is_wednesday" "weekday_is_thursday"
## [11] "weekday_is_friday" "weekday_is_saturday"
## [13] "weekday_is_sunday" "is_weekend"
news_data_fix <- news_data_drop %>%
mutate_at(c("data_channel_is_lifestyle", "data_channel_is_entertainment", "data_channel_is_bus", "data_channel_is_socmed","data_channel_is_tech","data_channel_is_world","weekday_is_monday","weekday_is_tuesday","weekday_is_wednesday", "weekday_is_thursday","weekday_is_friday","weekday_is_saturday", "weekday_is_sunday", "is_weekend"), factor)When I checked the data with skimr() and str() I saw that there are no missing values. Also after transforming dummy encodings from numerical to factor data is cleaned.
median_shares<-median(news_data$shares)
news_data_binary <- news_data_fix %>%
mutate(is_popular=as.numeric(shares> median_shares))
news_data_binary2 <- news_data_binary %>%
mutate_at(c("is_popular"), factor)
str(news_data_binary2)## Classes 'data.table' and 'data.frame': 39644 obs. of 46 variables:
## $ n_tokens_title : num 12 9 9 9 13 10 8 12 11 10 ...
## $ n_tokens_content : num 219 255 211 531 1072 ...
## $ n_unique_tokens : num 0.664 0.605 0.575 0.504 0.416 ...
## $ n_non_stop_unique_tokens : num 0.815 0.792 0.664 0.666 0.541 ...
## $ num_hrefs : num 4 3 3 9 19 2 21 20 2 4 ...
## $ num_self_hrefs : num 2 1 1 0 19 2 20 20 0 1 ...
## $ num_imgs : num 1 1 1 1 20 0 20 20 0 1 ...
## $ num_videos : num 0 0 0 0 0 0 0 0 0 1 ...
## $ average_token_length : num 4.68 4.91 4.39 4.4 4.68 ...
## $ num_keywords : num 5 4 6 7 7 9 10 9 7 5 ...
## $ data_channel_is_lifestyle : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ data_channel_is_entertainment: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 1 1 1 ...
## $ data_channel_is_bus : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
## $ data_channel_is_socmed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ data_channel_is_tech : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 2 2 1 ...
## $ data_channel_is_world : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
## $ self_reference_min_shares : num 496 0 918 0 545 8500 545 545 0 0 ...
## $ self_reference_max_shares : num 496 0 918 0 16000 8500 16000 16000 0 0 ...
## $ self_reference_avg_sharess : num 496 0 918 0 3151 ...
## $ weekday_is_monday : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ weekday_is_tuesday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_wednesday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_thursday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_friday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_saturday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_sunday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ is_weekend : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ LDA_00 : num 0.5003 0.7998 0.2178 0.0286 0.0286 ...
## $ LDA_01 : num 0.3783 0.05 0.0333 0.4193 0.0288 ...
## $ LDA_02 : num 0.04 0.0501 0.0334 0.4947 0.0286 ...
## $ LDA_03 : num 0.0413 0.0501 0.0333 0.0289 0.0286 ...
## $ LDA_04 : num 0.0401 0.05 0.6822 0.0286 0.8854 ...
## $ global_subjectivity : num 0.522 0.341 0.702 0.43 0.514 ...
## $ global_sentiment_polarity : num 0.0926 0.1489 0.3233 0.1007 0.281 ...
## $ global_rate_positive_words : num 0.0457 0.0431 0.0569 0.0414 0.0746 ...
## $ global_rate_negative_words : num 0.0137 0.01569 0.00948 0.02072 0.01213 ...
## $ rate_positive_words : num 0.769 0.733 0.857 0.667 0.86 ...
## $ rate_negative_words : num 0.231 0.267 0.143 0.333 0.14 ...
## $ avg_positive_polarity : num 0.379 0.287 0.496 0.386 0.411 ...
## $ avg_negative_polarity : num -0.35 -0.119 -0.467 -0.37 -0.22 ...
## $ title_subjectivity : num 0.5 0 0 0 0.455 ...
## $ title_sentiment_polarity : num -0.188 0 0 0 0.136 ...
## $ abs_title_subjectivity : num 0 0.5 0.5 0.5 0.0455 ...
## $ abs_title_sentiment_polarity : num 0.188 0 0 0 0.136 ...
## $ shares : int 593 711 1500 1200 505 855 556 891 3600 710 ...
## $ is_popular : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 2 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
| Name | news_data_binary2 |
| Number of rows | 39644 |
| Number of columns | 46 |
| _______________________ | |
| Column type frequency: | |
| factor | 15 |
| numeric | 31 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| data_channel_is_lifestyle | 0 | 1 | FALSE | 2 | 0: 37545, 1: 2099 |
| data_channel_is_entertainment | 0 | 1 | FALSE | 2 | 0: 32587, 1: 7057 |
| data_channel_is_bus | 0 | 1 | FALSE | 2 | 0: 33386, 1: 6258 |
| data_channel_is_socmed | 0 | 1 | FALSE | 2 | 0: 37321, 1: 2323 |
| data_channel_is_tech | 0 | 1 | FALSE | 2 | 0: 32298, 1: 7346 |
| data_channel_is_world | 0 | 1 | FALSE | 2 | 0: 31217, 1: 8427 |
| weekday_is_monday | 0 | 1 | FALSE | 2 | 0: 32983, 1: 6661 |
| weekday_is_tuesday | 0 | 1 | FALSE | 2 | 0: 32254, 1: 7390 |
| weekday_is_wednesday | 0 | 1 | FALSE | 2 | 0: 32209, 1: 7435 |
| weekday_is_thursday | 0 | 1 | FALSE | 2 | 0: 32377, 1: 7267 |
| weekday_is_friday | 0 | 1 | FALSE | 2 | 0: 33943, 1: 5701 |
| weekday_is_saturday | 0 | 1 | FALSE | 2 | 0: 37191, 1: 2453 |
| weekday_is_sunday | 0 | 1 | FALSE | 2 | 0: 36907, 1: 2737 |
| is_weekend | 0 | 1 | FALSE | 2 | 0: 34454, 1: 5190 |
| is_popular | 0 | 1 | FALSE | 2 | 0: 20082, 1: 19562 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| n_tokens_title | 0 | 1 | 10.40 | 2.11 | 2.00 | 9.00 | 10.00 | 12.00 | 23.00 | ▁▇▇▁▁ |
| n_tokens_content | 0 | 1 | 546.51 | 471.11 | 0.00 | 246.00 | 409.00 | 716.00 | 8474.00 | ▇▁▁▁▁ |
| n_unique_tokens | 0 | 1 | 0.55 | 3.52 | 0.00 | 0.47 | 0.54 | 0.61 | 701.00 | ▇▁▁▁▁ |
| n_non_stop_unique_tokens | 0 | 1 | 0.69 | 3.26 | 0.00 | 0.63 | 0.69 | 0.75 | 650.00 | ▇▁▁▁▁ |
| num_hrefs | 0 | 1 | 10.88 | 11.33 | 0.00 | 4.00 | 8.00 | 14.00 | 304.00 | ▇▁▁▁▁ |
| num_self_hrefs | 0 | 1 | 3.29 | 3.86 | 0.00 | 1.00 | 3.00 | 4.00 | 116.00 | ▇▁▁▁▁ |
| num_imgs | 0 | 1 | 4.54 | 8.31 | 0.00 | 1.00 | 1.00 | 4.00 | 128.00 | ▇▁▁▁▁ |
| num_videos | 0 | 1 | 1.25 | 4.11 | 0.00 | 0.00 | 0.00 | 1.00 | 91.00 | ▇▁▁▁▁ |
| average_token_length | 0 | 1 | 4.55 | 0.84 | 0.00 | 4.48 | 4.66 | 4.85 | 8.04 | ▁▁▇▃▁ |
| num_keywords | 0 | 1 | 7.22 | 1.91 | 1.00 | 6.00 | 7.00 | 9.00 | 10.00 | ▁▂▇▇▇ |
| self_reference_min_shares | 0 | 1 | 3998.76 | 19738.67 | 0.00 | 639.00 | 1200.00 | 2600.00 | 843300.00 | ▇▁▁▁▁ |
| self_reference_max_shares | 0 | 1 | 10329.21 | 41027.58 | 0.00 | 1100.00 | 2800.00 | 8000.00 | 843300.00 | ▇▁▁▁▁ |
| self_reference_avg_sharess | 0 | 1 | 6401.70 | 24211.33 | 0.00 | 981.19 | 2200.00 | 5200.00 | 843300.00 | ▇▁▁▁▁ |
| LDA_00 | 0 | 1 | 0.18 | 0.26 | 0.00 | 0.03 | 0.03 | 0.24 | 0.93 | ▇▁▁▁▁ |
| LDA_01 | 0 | 1 | 0.14 | 0.22 | 0.00 | 0.03 | 0.03 | 0.15 | 0.93 | ▇▁▁▁▁ |
| LDA_02 | 0 | 1 | 0.22 | 0.28 | 0.00 | 0.03 | 0.04 | 0.33 | 0.92 | ▇▁▁▁▁ |
| LDA_03 | 0 | 1 | 0.22 | 0.30 | 0.00 | 0.03 | 0.04 | 0.38 | 0.93 | ▇▁▁▁▂ |
| LDA_04 | 0 | 1 | 0.23 | 0.29 | 0.00 | 0.03 | 0.04 | 0.40 | 0.93 | ▇▂▁▁▂ |
| global_subjectivity | 0 | 1 | 0.44 | 0.12 | 0.00 | 0.40 | 0.45 | 0.51 | 1.00 | ▁▃▇▁▁ |
| global_sentiment_polarity | 0 | 1 | 0.12 | 0.10 | -0.39 | 0.06 | 0.12 | 0.18 | 0.73 | ▁▂▇▁▁ |
| global_rate_positive_words | 0 | 1 | 0.04 | 0.02 | 0.00 | 0.03 | 0.04 | 0.05 | 0.16 | ▅▇▁▁▁ |
| global_rate_negative_words | 0 | 1 | 0.02 | 0.01 | 0.00 | 0.01 | 0.02 | 0.02 | 0.18 | ▇▁▁▁▁ |
| rate_positive_words | 0 | 1 | 0.68 | 0.19 | 0.00 | 0.60 | 0.71 | 0.80 | 1.00 | ▁▁▃▇▃ |
| rate_negative_words | 0 | 1 | 0.29 | 0.16 | 0.00 | 0.19 | 0.28 | 0.38 | 1.00 | ▅▇▃▁▁ |
| avg_positive_polarity | 0 | 1 | 0.35 | 0.10 | 0.00 | 0.31 | 0.36 | 0.41 | 1.00 | ▁▇▃▁▁ |
| avg_negative_polarity | 0 | 1 | -0.26 | 0.13 | -1.00 | -0.33 | -0.25 | -0.19 | 0.00 | ▁▁▂▇▃ |
| title_subjectivity | 0 | 1 | 0.28 | 0.32 | 0.00 | 0.00 | 0.15 | 0.50 | 1.00 | ▇▂▂▁▂ |
| title_sentiment_polarity | 0 | 1 | 0.07 | 0.27 | -1.00 | 0.00 | 0.00 | 0.15 | 1.00 | ▁▁▇▂▁ |
| abs_title_subjectivity | 0 | 1 | 0.34 | 0.19 | 0.00 | 0.17 | 0.50 | 0.50 | 0.50 | ▃▂▁▁▇ |
| abs_title_sentiment_polarity | 0 | 1 | 0.16 | 0.23 | 0.00 | 0.00 | 0.00 | 0.25 | 1.00 | ▇▂▁▁▁ |
| shares | 0 | 1 | 3395.38 | 11626.95 | 1.00 | 946.00 | 1400.00 | 2800.00 | 843300.00 | ▇▁▁▁▁ |
The popularity is continuous target variable in the original dataset as it can be seen histogram. But I split it from the median value and modeled as binary classification. Because there were a lot of binary dummy encoding and there very too many features. Modeling it as regression did not work well. As it can be seen from the barplot there is no class imbalance.
There are over 40.000 observations in the news dataset it was taking too long to builds models so I have sampled 4000 observations from it. After the sampling I seperated 70% of the data for train and 30% for testing.
news_data_omit <- na.omit(news_data_binary2)
news_data_omit2 <- select(news_data_omit, -c("shares"))
news_data_omit2 = news_data_omit2[sample(nrow(news_data_omit2), 4000), ]
set.seed(123)
split_news <- createDataPartition(news_data_omit2$`is_popular`, p = .7,
list = FALSE,
times = 1)
news_train <- news_data_omit2[split_news,]
news_test <- news_data_omit2[-split_news,]In penalized regression approach we need to do standardization. But our data has some binary variables. So, I only applied standardization to numerical values. I have standardized the test and train data to be used in the penalized regression approach. In tree based approaches we do not need to use the standardized versions.
news_train_standard <- news_train %>%
mutate_if(is.numeric, scale)
news_test_standard <- news_test %>%
mutate_if(is.numeric, scale)news_train_matrix = matrix()
news_test_matrix = matrix()
news_target = matrix()
news_target_train <- data.matrix(news_train_standard$is_popular)
news_target_test <- data.matrix(news_test_standard$is_popular)
news_train_matrix <- data.matrix(news_train_standard[,-c("is_popular")])
news_test_matrix <- data.matrix(news_test_standard[,-c("is_popular")])
news_PRA_cv = cv.glmnet(news_train_matrix, news_target_train, family="binomial")
options(repr.plot.width=5, repr.plot.height=5)
plot(news_PRA_cv)## [1] 0.006113813
## [1] 0.01550074
news_PRA <- glmnet(news_train_matrix, news_target_train ,family="binomial", alpha=1,lambda=news_PRA_cv$lambda.min)
summary(news_PRA)## Length Class Mode
## a0 1 -none- numeric
## beta 44 dgCMatrix S4
## df 1 -none- numeric
## dim 2 -none- numeric
## lambda 1 -none- numeric
## dev.ratio 1 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## classnames 2 -none- character
## call 6 -none- call
## nobs 1 -none- numeric
Now we can calculate accuracy for train and test data using sum of squared errors.
news_tune_grid = expand.grid(min_leaf_obs=seq(3,10,1),complexity=seq(0,0.3,0.1))
news_folds <- createFolds(1:nrow(news_train), k = 10)
news_cv_data = news_train
news_cv_param = tibble()
news_all_cv_stat = data.table()
for(p in 1:nrow(news_tune_grid)) {
news_temp_param = news_tune_grid[p,]
news_temp_result = tibble()
for(i in 1:10){
news_temp_test = news_cv_data[news_folds[[i]],]
news_temp_train = news_cv_data[-news_folds[[i]],]
news_temp_fit=rpart(is_popular~.,data = news_temp_train, method="class", control = rpart.control(minbucket = news_temp_param$min_leaf_obs,cp=news_temp_param$complexity))
news_temp_test$Prediction = predict(news_temp_fit, news_temp_test, type="class")
news_temp_result=rbind( news_temp_result, news_temp_test)
}
news_temp_stat = data.table(news_temp_param, Accuracy= sum (news_temp_test$Prediction==news_temp_test$is_popular)/nrow(news_temp_test))
print(news_temp_stat)
news_all_cv_stat = rbind(news_all_cv_stat, news_temp_stat)
}## min_leaf_obs complexity Accuracy
## 1: 3 0 0.5785714
## min_leaf_obs complexity Accuracy
## 1: 4 0 0.5857143
## min_leaf_obs complexity Accuracy
## 1: 5 0 0.5928571
## min_leaf_obs complexity Accuracy
## 1: 6 0 0.575
## min_leaf_obs complexity Accuracy
## 1: 7 0 0.5428571
## min_leaf_obs complexity Accuracy
## 1: 8 0 0.5821429
## min_leaf_obs complexity Accuracy
## 1: 9 0 0.575
## min_leaf_obs complexity Accuracy
## 1: 10 0 0.5642857
## min_leaf_obs complexity Accuracy
## 1: 3 0.1 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 4 0.1 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 5 0.1 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 6 0.1 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 7 0.1 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 8 0.1 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 9 0.1 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 10 0.1 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 3 0.2 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 4 0.2 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 5 0.2 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 6 0.2 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 7 0.2 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 8 0.2 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 9 0.2 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 10 0.2 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 3 0.3 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 4 0.3 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 5 0.3 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 6 0.3 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 7 0.3 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 8 0.3 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 9 0.3 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 10 0.3 0.4821429
## min_leaf_obs complexity Accuracy
## 1: 5 0 0.5928571
Then I use these parameters that perform best for building the CART tree.
news_DT = rpart(is_popular~.,data = news_train, method="class",control = rpart.control(minbucket = news_best_DT$min_leaf_obs,cp=news_best_DT$complexity))
fancyRpartPlot(news_DT)For RF we set only m (set other parameters as J=500 and the minimal number of observations per tree leaf=5)
For mtry square root of features are taken generally. As there are around 50 variables I am doing cross validation for 7, 8 and 9.
set.seed(123)
news_m=c(7,8,9)
J=500
min_size_of_terminal_nodes=5
n_folds=10
news_RF_fitControl=trainControl(method = "repeatedcv",
number = n_folds,
repeats = 1, classProbs = TRUE, search="random")
news_RF_grid=expand.grid(mtry = news_m)
news_RF_train <- train(make.names(as.factor(is_popular))~., data = news_train, method = "rf", trControl = news_RF_fitControl, ntree=J, nodesize = min_size_of_terminal_nodes, tuneGrid = news_RF_grid)
print(news_RF_train[["results"]])## mtry Accuracy Kappa AccuracySD KappaSD
## 1 7 0.6226398 0.2452905 0.02918176 0.05835767
## 2 8 0.6187112 0.2374175 0.02616970 0.05234155
## 3 9 0.6183516 0.2367177 0.02531512 0.05062853
set.seed(123)
news_GBM_control <- trainControl(method='cv', number=5, search='grid', summaryFunction = twoClassSummary, classProbs = T)
news_GBM_grid <- expand.grid(interaction.depth = c(2, 3, 5), n.trees = c(50,100), shrinkage = c(0.05,0.1), n.minobsinnode = 10)
print(news_GBM_grid)## interaction.depth n.trees shrinkage n.minobsinnode
## 1 2 50 0.05 10
## 2 3 50 0.05 10
## 3 5 50 0.05 10
## 4 2 100 0.05 10
## 5 3 100 0.05 10
## 6 5 100 0.05 10
## 7 2 50 0.10 10
## 8 3 50 0.10 10
## 9 5 50 0.10 10
## 10 2 100 0.10 10
## 11 3 100 0.10 10
## 12 5 100 0.10 10
set.seed(123)
news_GBM_train <- train(make.names(as.factor(is_popular))~.,
data = news_train,
method = 'gbm',
metric = 'ROC',
trControl=news_GBM_control,
tuneGrid = news_GBM_grid,
verbose=FALSE)
print(news_GBM_train)## Stochastic Gradient Boosting
##
## 2801 samples
## 44 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 2241, 2241, 2241, 2241, 2240
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees ROC Sens Spec
## 0.05 2 50 0.6740543 0.6171429 0.6224021
## 0.05 2 100 0.6765625 0.6221429 0.6245653
## 0.05 3 50 0.6778436 0.6014286 0.6323894
## 0.05 3 100 0.6794994 0.6142857 0.6331291
## 0.05 5 50 0.6782237 0.6142857 0.6295348
## 0.05 5 100 0.6827042 0.6092857 0.6252745
## 0.10 2 50 0.6808680 0.6285714 0.6259736
## 0.10 2 100 0.6798299 0.6235714 0.6231342
## 0.10 3 50 0.6765526 0.6221429 0.6188612
## 0.10 3 100 0.6746729 0.6278571 0.6088409
## 0.10 5 50 0.6710527 0.6100000 0.6117184
## 0.10 5 100 0.6681641 0.6192857 0.6088612
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
## 5, shrinkage = 0.05 and n.minobsinnode = 10.
## var rel.inf
## self_reference_min_shares self_reference_min_shares 8.3351084
## is_weekend1 is_weekend1 7.7612017
## LDA_02 LDA_02 7.1272192
## LDA_01 LDA_01 5.0884619
## average_token_length average_token_length 4.9510496
## num_hrefs num_hrefs 4.9038212
## global_sentiment_polarity global_sentiment_polarity 4.4833163
## LDA_00 LDA_00 4.0990566
## data_channel_is_entertainment1 data_channel_is_entertainment1 4.0989450
## n_non_stop_unique_tokens n_non_stop_unique_tokens 4.0747828
## self_reference_avg_sharess self_reference_avg_sharess 3.8403526
## global_rate_positive_words global_rate_positive_words 3.6384221
## data_channel_is_world1 data_channel_is_world1 3.0886016
## global_subjectivity global_subjectivity 3.0449110
## data_channel_is_socmed1 data_channel_is_socmed1 2.9876292
## n_unique_tokens n_unique_tokens 2.5487488
## rate_negative_words rate_negative_words 2.2435258
## LDA_03 LDA_03 2.2270708
## avg_negative_polarity avg_negative_polarity 2.1838811
## num_imgs num_imgs 2.1811247
## n_tokens_content n_tokens_content 2.1016394
## title_sentiment_polarity title_sentiment_polarity 1.8170108
## self_reference_max_shares self_reference_max_shares 1.6757440
## abs_title_subjectivity abs_title_subjectivity 1.5563534
## LDA_04 LDA_04 1.5230991
## rate_positive_words rate_positive_words 1.2361337
## avg_positive_polarity avg_positive_polarity 1.2054783
## global_rate_negative_words global_rate_negative_words 1.1606496
## n_tokens_title n_tokens_title 1.1451287
## title_subjectivity title_subjectivity 0.7022705
## num_keywords num_keywords 0.6062751
## abs_title_sentiment_polarity abs_title_sentiment_polarity 0.5167960
## num_videos num_videos 0.4054380
## num_self_hrefs num_self_hrefs 0.3947495
## weekday_is_friday1 weekday_is_friday1 0.2863353
## weekday_is_sunday1 weekday_is_sunday1 0.1890140
## weekday_is_tuesday1 weekday_is_tuesday1 0.1665105
## weekday_is_wednesday1 weekday_is_wednesday1 0.1487118
## data_channel_is_tech1 data_channel_is_tech1 0.1280160
## weekday_is_thursday1 weekday_is_thursday1 0.1274161
## data_channel_is_lifestyle1 data_channel_is_lifestyle1 0.0000000
## data_channel_is_bus1 data_channel_is_bus1 0.0000000
## weekday_is_monday1 weekday_is_monday1 0.0000000
## weekday_is_saturday1 weekday_is_saturday1 0.0000000
prediction_news_PRA_test <- predict(news_PRA, newx=news_test_matrix, s=c("lambda.min"), type="class")
prediction_news_PRA_train <- predict(news_PRA, newx=news_train_matrix, s=c("lambda.min"), type="class")
cm_news_PRA_test <- confusionMatrix(factor(prediction_news_PRA_test), factor(news_test$is_popular))
cm_news_PRA_train <- confusionMatrix(factor(prediction_news_PRA_train), factor(news_train$is_popular))
prediction_news_DT_test <- predict(news_DT, news_test, type="class")
prediction_news_DT_train <- predict(news_DT, news_train, type="class")
cm_news_DT_test <- confusionMatrix(factor(prediction_news_DT_test), factor(news_test$is_popular))
cm_news_DT_train <- confusionMatrix(factor(prediction_news_DT_train), factor(news_train$is_popular))
prediction_news_RF_test <- predict(news_RF_train, news_test, type="raw")
prediction_news_RF_train <- predict(news_RF_train, news_train, type="raw")
cm_news_RF_test <- confusionMatrix(factor(prediction_news_RF_test), factor(make.names(news_test$is_popular)))
cm_news_RF_train <- confusionMatrix(factor(prediction_news_RF_train), factor(make.names(news_train$is_popular)))
prediction_news_SGB_test <- predict(news_GBM_train, news_test, type="raw")
prediction_news_SGB_train <- predict(news_GBM_train, news_train, type="raw")
cm_news_SGB_test <- confusionMatrix(factor(prediction_news_SGB_test), factor(make.names(news_test$is_popular)))
cm_news_SGB_train <- confusionMatrix(factor(prediction_news_SGB_train), factor(make.names(news_train$is_popular)))
news_test_performances <- data.table(news_PRA_test=cm_news_PRA_test$overall["Accuracy"], news_DT_test=cm_news_DT_test$overall["Accuracy"], news_RF_test=cm_news_RF_test$overall["Accuracy"], news_SGB_test=cm_news_SGB_test$overall["Accuracy"])
news_train_performances <- data.table(news_PRA_train=cm_news_PRA_train$overall["Accuracy"], news_DT_train=cm_news_DT_train$overall["Accuracy"], news_RF_train=cm_news_RF_train$overall["Accuracy"], news_SGB_train=cm_news_SGB_train$overall["Accuracy"])
news_test_performances## news_PRA_test news_DT_test news_RF_test news_SGB_test
## 1: 0.616347 0.5629691 0.6422018 0.64804
## news_PRA_train news_DT_train news_RF_train news_SGB_train
## 1: 0.6301321 0.8543377 0.998929 0.7065334
According to results for 4 dataset my interpretations are as follows: